{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "marker = '↑'\n",
    "marker = '*'\n",
    "\n",
    "def _mark(match):\n",
    "    start, end = match.span()\n",
    "    return marker * (end - start)\n",
    "\n",
    "def show_regex_match(regex, text):\n",
    "    \"\"\"\n",
    "    Prints the string with the regex match highlighted.\n",
    "    \"\"\"\n",
    "    res = re.sub(f'({regex})', r'\\033[1;30;43m\\1\\033[m', text)\n",
    "    matches = ''.join([marker if c == marker else ' '\n",
    "                       for c in re.sub(regex, _mark, text)])\n",
    "    print(f'  Regex: {regex}')\n",
    "    print(f'   Text: {res}')\n",
    "    print(f'Matches: {matches}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "log_entry = r'169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]\"GET /stat141/Winter04 HTTP/1.1\" 301 328 \"http://anson.ucdavis.edu/courses\"\"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)\"'"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "lLVVSGne3AIY"
   },
   "source": [
    "# Regular Expressions\n",
    "\n",
    "*Regular expressions* (or *regex* for short) are special patterns that we use\n",
    "to match parts of strings.\n",
    "Think about the format of a Social Security number (SSN) like `134-42-2012`.\n",
    "To describe this format, we \n",
    "might say that SSNs consist of three digits, then a dash, two digits, another\n",
    "dash, then four digits. Regexes let us capture this pattern in code. \n",
    "Regexes give us a compact and powerful way to describe this pattern of digits and dashes. The syntax of regular expressions is fortunately quite\n",
    "simple to learn; we introduce nearly all of the syntax in this section alone.\n",
    "\n",
    "As we introduce the concepts, we tackle some of the examples described in an\n",
    "earlier section and show how to carry out the tasks with regular expressions.\n",
    "Almost all programming languages have a library to match patterns using regular\n",
    "expressions, making regular expressions useful in any programming language. We use some of the common methods available in the Python\n",
    "built-in `re` module to accomplish the tasks from the examples. These methods\n",
    "are summarized in {numref}`Table %s <regex-methods>` at the end of this section, where the basic usage and\n",
    "return value are briefly described. Since we only cover a few of the most\n",
    "commonly used methods, you may find it useful to consult [the official\n",
    "documentation on the `re` module][re] as well.\n",
    "\n",
    "[re]: https://docs.python.org/3/library/re.html"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regular expressions are based on searching a string one character (aka\n",
    "*literal*) at a time for a pattern. We call this notion *concatenation of\n",
    "literals*."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "744CBmB07MUn"
   },
   "source": [
    "## Concatenation of Literals"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Concatenation is best explained with a basic example. Suppose we are looking\n",
    "for the pattern `cat` in the string `cards scatter!`. Here's how the computer matches a literal pattern like `cat`:\n",
    "\n",
    "1. Begin with the first character in the string (`c`).\n",
    "1. Does the character in the string match the first character in the pattern (`c`)?\n",
    "1. No: move on to the next character of the string. Return to step 2.\n",
    "1. Yes: does the rest of the pattern match (`a`, then `t`)?\n",
    "1. No: move on to the next character of the string (one character past the location in step 2). Return to step 2. \n",
    "1. Yes: report that the pattern was found.\n",
    "\n",
    "\n",
    "{numref}`Figure %s <fig:regex-literals>` contains a diagram of\n",
    "the idea behind this search through the string\n",
    "one character at a time. The pattern \"cat\" is found within the string\n",
    "`cards scatter!` in positions 8–10. Once you get the hang of this process, you can\n",
    "move on to the richer set of patterns; they all follow this basic\n",
    "paradigm."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} figures/regex-literals.svg\n",
    "---\n",
    "name: fig:regex-literals\n",
    "---\n",
    "\n",
    "To match literal patterns, the regex engine moves along the string and checks one literal at a time for a match of the entire pattern. Notice that the pattern is found within the word `scatters` and that a partial match is found in `cards`.\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "\n",
    "In the preceding example, we observe that regular expressions can match patterns\n",
    "that appear anywhere in the input string. In Python, this behavior differs\n",
    "depending on the method used to match the regex—some methods only return a\n",
    "match if the regex appears at the start of the string; other methods return a\n",
    "match anywhere in the string.\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "JWMIE3Dz3AI4"
   },
   "source": [
    "These richer patterns are made of character classes and metacharacters like wildcards. We describe them in the subsections that follow. \n",
    "\n",
    "### Character classes\n",
    "We can make patterns more flexible by using a *character class* \n",
    "(also known as a *character set*), which\n",
    "lets us specify a collection of equivalent characters to match. This allows us\n",
    "to create more relaxed matches. To create a character class, wrap the set of\n",
    "desired characters in brackets `[ ]`. \n",
    "For example, the pattern `[0123456789]` means \"match any literal\n",
    "within the brackets\"---in this case, any single digit.\n",
    "Then, the following regular expression matches three digits:\n",
    "\n",
    "```\n",
    "[0123456789][0123456789][0123456789]\n",
    "```\n",
    "\n",
    "This is such a commonly used character class that\n",
    "there is a shorthand notation for the range of digits, `[0-9]`. Character\n",
    "classes allow us to create a regex for SSNs:\n",
    "\n",
    "```\n",
    "[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]\n",
    "```\n",
    "\n",
    "Two other ranges that are commonly used in character classes are `[a-z]` for\n",
    "lowercase and `[A-Z]` for uppercase letters. We can combine ranges with other\n",
    "equivalent characters and use partial ranges. For example, `[a-cX-Z27]` is\n",
    "equivalent to the character class `[abcXYZ27]`. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     },
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 282,
     "status": "ok",
     "timestamp": 1524681499797,
     "user": {
      "displayName": "Andrew Kim",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
      "userId": "109388192039284916411"
     },
     "user_tz": 420
    },
    "id": "q21x1CHU3AI6",
    "outputId": "95041818-9995-4b5b-8993-6cd18747d1de"
   },
   "source": [
    "Let's return to our original pattern `cat` and modify it to include two character classes:\n",
    "\n",
    "```\n",
    "c[oa][td]\n",
    "```\n",
    "\n",
    "This pattern matches `cat`, but it also matches `cot`, `cad`, and `cod`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Regex: c[oa][td]\n",
      "   Text: The \u001b[1;30;43mcat\u001b[m eats \u001b[1;30;43mcod\u001b[m, \u001b[1;30;43mcad\u001b[ms, and \u001b[1;30;43mcot\u001b[ms, but not coats.\n",
      "Matches:     ***      ***  ***       ***                 \n"
     ]
    }
   ],
   "source": [
    "show_regex_match(r'c[oa][td]', 'The cat eats cod, cads, and cots, but not coats.')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The idea of moving through the string one character at a time still remains the core notion, but now there's a bit more flexibility in which literal is considered a match."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "BvMwjXJS3AKI"
   },
   "source": [
    "### Wildcard character\n",
    "When we really don't care what the literal is, we can specify this with `.`, the\n",
    "period character. This matches any character except a newline."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "CUDgfZsR3AJK"
   },
   "source": [
    "### Negated character classes\n",
    "A *negated character class* matches any character\n",
    "*except* those between the square brackets. To create a negated character\n",
    "class, place the caret symbol as the first character after the left square\n",
    "bracket. For example, `[^0-9]` matches any character except a digit."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shorthands for character classes\n",
    "Some character sets are so common that there are shorthands for them. For example, `\\d` is short for `[0-9]`. We can use this shorthand to simplify our search for SSN:\n",
    "\n",
    "```\n",
    "\\d\\d\\d-\\d\\d-\\d\\d\\d\\d\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our regular expression for SSNs isn't quite bulletproof. If the string has extra digits at the beginning or end of the pattern we're looking for, then we still get a match. Note that we add the `r` character before the quotes to create a raw string, which makes regexes easier to write:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Regex: \\d\\d\\d-\\d\\d-\\d\\d\\d\\d\n",
      "   Text: My other number is 6\u001b[1;30;43m382-13-3842\u001b[m0.\n",
      "Matches:                     ***********  \n"
     ]
    }
   ],
   "source": [
    "show_regex_match(r'\\d\\d\\d-\\d\\d-\\d\\d\\d\\d', 'My other number is 6382-13-38420.')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can remedy the situation with a different sort of metacharacter: one that matches a word boundary. \n",
    "\n",
    "### Anchors and boundaries\n",
    "At times we want to match a position before, after, or between characters. One example is to locate the beginning or end of a string; these are called _anchors_. Another is to locate the beginning or end of a word, which we call a _boundary_. The metacharacter `\\b` denotes the boundary of a word. It has 0 length, and it matches whitespace or punctuation on the boundary of the pattern. We can use it to fix our regular expression for SSNs: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Regex: \\d\\d\\d-\\d\\d-\\d\\d\\d\\\n",
      "   Text: My other number is 6382-13-38420.\n",
      "Matches:                                  \n"
     ]
    }
   ],
   "source": [
    "show_regex_match('r\\b\\d\\d\\d-\\d\\d-\\d\\d\\d\\d\\b', 'My other number is 6382-13-38420.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Regex: \\b\\d\\d\\d-\\d\\d-\\d\\d\\d\\d\\b\n",
      "   Text: My reeeal number is \u001b[1;30;43m382-13-3842\u001b[m.\n",
      "Matches:                     *********** \n"
     ]
    }
   ],
   "source": [
    "show_regex_match(r'\\b\\d\\d\\d-\\d\\d-\\d\\d\\d\\d\\b', 'My reeeal number is 382-13-3842.')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Escaping metacharacters\n",
    "We have now seen several special characters, called *metacharacters*: `[` and\n",
    "`]` denotes a character class, `^` switches to a negated character class, `.`\n",
    "represents any character, and `-` denotes a range. But sometimes we might want\n",
    "to create a pattern that matches one of these literals. When this happens, we\n",
    "must escape it with a backslash. For example, we can match the literal left\n",
    "bracket character using the regex `\\[`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     },
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 303,
     "status": "ok",
     "timestamp": 1524681067395,
     "user": {
      "displayName": "Andrew Kim",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
      "userId": "109388192039284916411"
     },
     "user_tz": 420
    },
    "id": "n6NNGiO93AKJ",
    "outputId": "6248e853-8c58-4362-9004-a5028c7e5218",
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Regex: \\[\n",
      "   Text: Today is \u001b[1;30;43m[\u001b[m2022/01/01]\n",
      "Matches:          *           \n"
     ]
    }
   ],
   "source": [
    "show_regex_match(r'\\[', 'Today is [2022/01/01]')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we show how quantifiers can help create a more compact and clear\n",
    "regular expression for SSNs. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "cE-nOxzl3AJP"
   },
   "source": [
    "## Quantifiers\n",
    "\n",
    "To create a regex to match SSNs, we wrote:\n",
    "\n",
    "```\n",
    "\\b[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]\\b\n",
    "```\n",
    "\n",
    "This matches a \"word\" consisting of three digits, a dash, two more digits, a dash, and four more digits.\n",
    "\n",
    "Quantifiers allow us to match multiple consecutive appearances of a literal. We\n",
    "specify the number of repetitions by placing the number in curly braces `{ }`. \n",
    "\n",
    "Let's use Python's built-in `re` module for matching this pattern:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     },
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 293,
     "status": "ok",
     "timestamp": 1524681487610,
     "user": {
      "displayName": "Andrew Kim",
      "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
      "userId": "109388192039284916411"
     },
     "user_tz": 420
    },
    "id": "R_5z2Idd2q5Z",
    "outputId": "ea92182c-652a-4f3f-d556-0619797074e0"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['382-34-3840']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "ssn_re = r'\\b[0-9]{3}-[0-9]{2}-[0-9]{4}\\b'\n",
    "re.findall(ssn_re, 'My SSN is 382-34-3840.')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our pattern shouldn't match phone numbers. Let's try it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(ssn_re, 'My phone is 382-123-3842.')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MD0XqVY73AJX"
   },
   "source": [
    "A quantifier always modifies the character or character class to its immediate\n",
    "left. {numref}`Table %s <quantifier-ex>` shows the complete syntax for quantifiers."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4HXw_UbI3AJY"
   },
   "source": [
    ":::{table} Quantifier examples\n",
    ":name: quantifier-ex\n",
    "\n",
    "Quantifier | Meaning\n",
    "--- | ---\n",
    "{m, n} | Match the preceding character m to n times.\n",
    "{m} | Match the preceding character exactly m times.\n",
    "{m,} | Match the preceding character at least m times.\n",
    "{,n} | Match the preceding character at most n times.\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "h6_CUONr3AJa"
   },
   "source": [
    "Some commonly used quantifiers have a shorthand, as shown in {numref}`Table %s <short-quantifiers>`.\n",
    "\n",
    ":::{table} Shorthand quantifiers\n",
    ":name: short-quantifiers\n",
    "\n",
    "Symbol | Quantifier | Meaning\n",
    "--- | --- | ---\n",
    " `*` | {0,} | Match the preceding character 0 or more times.\n",
    " `+` | {1,} | Match the preceding character 1 or more times.\n",
    " `?` | {0,1} | Match the preceding character 0 or 1 time.\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "16ZfQsaqInak"
   },
   "source": [
    "Quantifiers are greedy and will return the longest match possible. This sometimes results in\n",
    "surprising behavior. Since an SSN starts and ends with a digit, we might think\n",
    "the following shorter regex will be a simpler approach for finding SSNs.  Can\n",
    "you figure out what went wrong in the matching?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['382-34-3842 and hers is 382-34-3333']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ssn_re_dot = r'[0-9].+[0-9]'\n",
    "re.findall(ssn_re_dot, 'My SSN is 382-34-3842 and hers is 382-34-3333.')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that we use the metacharacter `.` to match any character. In many cases, using a more specific character class prevents these false \"over-matches.\" Our earlier pattern that includes word boundaries does this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['382-34-3842', '382-34-3333']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(ssn_re, 'My SSN is 382-34-3842 and hers is 382-34-3333.')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some platforms allow you to turn off greedy matching and use *lazy* matching, which returns the shortest string.\n",
    "\n",
    "Literal concatenation and quantifiers are two of the core concepts in regular\n",
    "expressions. Next, we introduce two more core concepts: alternation and grouping."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Alternation and Grouping to Create Features\n",
    "\n",
    "Character classes let us match multiple options for a single literal.\n",
    "We can use alternation to match multiple options for a group of literals.\n",
    "For instance, in the food safety example in \n",
    "{numref}`Chapter %s <ch:wrangling>`, we marked violations related to\n",
    "body parts by seeing if the violation had the substring\n",
    "`hand`, `nail`, `hair`, or `glove`.\n",
    "We can use the `|` character in a regex to specify this alteration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['hand', 'glove']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "body_re = r\"hand|nail|hair|glove\"\n",
    "re.findall(body_re, \"unclean hands or improper use of gloves\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['hair', 'nail']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(body_re, \"Unsanitary employee garments hair or nails\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With parentheses we can locate parts of a pattern, which are called *regex groups*. For example, we can use regex groups to extract the day, month, year, and time from the web server log entry:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['[26/Jan/2004:10:47:58 -0800]']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This pattern matches the entire timestamp\n",
    "time_re = r\"\\[[0-9]{2}/[a-zA-z]{3}/[0-9]{4}:[0-9:\\- ]*\\]\"\n",
    "re.findall(time_re, log_entry)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('26', 'Jan', '2004', '10:47:58 -0800')]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Same regex, but we use parens to make regex groups...\n",
    "time_re = r\"\\[([0-9]{2})/([a-zA-z]{3})/([0-9]{4}):([0-9:\\- ]*)\\]\"\n",
    "\n",
    "# ...which tells findall() to split up the match into its groups\n",
    "re.findall(time_re, log_entry)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, `re.findall` returns a list of tuples containing the individual\n",
    "components of the date and time of the web log.\n",
    "\n",
    "We have introduced a lot of terminology, so\n",
    "in the next section we bring it all together into a set of tables for easy reference. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TrPjIHHD8rOC"
   },
   "source": [
    "## Reference Tables\n",
    "\n",
    "We conclude this section with a few tables that summarize order of\n",
    "operation, metacharacters, and shorthands for character classes.\n",
    "We also provide tables summarizing the handful of methods in the `re`\n",
    "Python library that we have used in this section."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TrPjIHHD8rOC"
   },
   "source": [
    "The four basic operations for regular expressions—concatenation, quantifying,\n",
    "alternation, and grouping—have an order of precedence, which we make explicit\n",
    "in {numref}`Table %s <regex-order>`. \n",
    "\n",
    ":::{table} Order of operations\n",
    ":name: regex-order\n",
    "\n",
    "| Operation  | Order | Example   | Matches | \n",
    "| ---------- | ----- | --------- | ------- |\n",
    "| Concatenation | 3  | `cat`     |  `cat`    | \n",
    "| Alternation   | 4  | `cat\\|mouse` |  `cat` and `mouse` | \n",
    "| Quantifying   | 2  | `cat?`  | `ca` and `cat`  |  \n",
    "| Grouping      | 1  |  c(at)? |  `c` and `cat`    | \n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TrPjIHHD8rOC"
   },
   "source": [
    "{numref}`Table %s <regex-meta>` provides a list of the metacharacters introduced in this\n",
    "section, plus a few more. The column labeled \"Doesn't match\" gives examples of\n",
    "strings that the example regexes don't match.\n",
    "\n",
    ":::{table} Metacharacters\n",
    ":name: regex-meta\n",
    "\n",
    "\n",
    "| Char   | Description                         | Example                    | Matches        | Doesn't match |\n",
    "| ------ | ----------------------------------- | -------------------------- | -------------- | ------------- |\n",
    "| .      | Any character except \\n             | `...`                      | abc            | ab    |\n",
    "| [ ]    | Any character inside brackets       | `[cb.]ar`                  | car<br>.ar     | jar           |\n",
    "| [^ ]   | Any character _not_ inside brackets | `[^b]ar`                   | car<br>par     | bar<br>ar     |\n",
    "| \\*     | ≥ 0 or more of previous symbol, shorthand for {0,}          | `[pb]*ark`                 | bbark<br>ark   | dark          |\n",
    "| +      | ≥ 1 or more of previous symbol, shorthand for {1,}           | `[pb]+ark`                 | bbpark<br>bark | dark<br>ark   |\n",
    "| ?      | 0 or 1 of previous symbol, shorthand for {0,1}               | `s?he`                     | she<br>he      | the           |\n",
    "| {_n_}  | Exactly _n_ of previous symbol          | `hello{3}`                 | hellooo        | hello         |\n",
    "| &#124; | Pattern before or after bar         | <code>we&#124;[ui]s</code> | we<br>us<br>is | es<br>e<br>s        |\n",
    "| \\      | Escape next character              | `\\[hi\\]`                   | [hi]           | hi            |\n",
    "| ^      | Beginning of line                   | `^ark`                     | ark two        | dark          |\n",
    "| \\$     | End of line                         | `ark$`                     | noahs ark      | noahs arks    |\n",
    "| \\\\b    | Word boundary                       | `ark\\b`                    | ark of noah    | noahs arks    |\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TrPjIHHD8rOC"
   },
   "source": [
    "Additionally, in {numref}`Table %s <regex-shorthand>`, we provide shorthands for some commonly used character\n",
    "sets. These shorthands don't need `[ ]`.\n",
    "\n",
    ":::{table} Character class shorthands \n",
    ":name: regex-shorthand\n",
    "\n",
    "| Description                   | Bracket form       | Shorthand |\n",
    "| ----------------------------- | ------------------ | --------- |\n",
    "| Alphanumeric character        | `[a-zA-Z0-9_]`      | `\\w`      |\n",
    "| Not an alphanumeric character | `[^a-zA-Z0-9_]`     | `\\W`      |\n",
    "| Digit                         | `[0-9]`            | `\\d`      |\n",
    "| Not a digit                   | `[^0-9]`           | `\\D`      |\n",
    "| Whitespace                    | `[\\t\\n\\f\\r\\p{Z}]`  | `\\s`      |\n",
    "| Not whitespace                | `[^\\t\\n\\f\\r\\p{z}]` | `\\S`      |\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We used the following methods in `re` in this chapter. The names of the methods\n",
    "are indicative of the functionality they perform: *search* or\n",
    "*match* a pattern in a\n",
    "string; *find all* cases of a pattern in a string; *sub*stitute all occurrences\n",
    "of a pattern with a substring; and *split* a string into pieces at the pattern.\n",
    "Each requires a pattern and string to be specified, and some have\n",
    "extra arguments.\n",
    "{numref}`Table %s <regex-methods>` provides the format of the method usage and a\n",
    "description of the return value. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{table} Regular expression methods\n",
    ":name: regex-methods\n",
    "\n",
    "| Method                                 | Return value |\n",
    "| -------------------------------------- | ------------------ |\n",
    "| `re.search(pattern, string)`           | Truthy match object if the pattern is found anywhere in the string, otherwise `None` |\n",
    "| `re.match(pattern, string)`           | Truthy match object if the pattern is found at the beginning of the string, otherwise `None` |\n",
    "| `re.findall(pattern, string)`          | List of all matches of `pattern` in `string`|\n",
    "| `re.sub(pattern, replacement, string)` | String where all occurrences of `pattern` are replaced by  `replacement` in the `string` |\n",
    "| `re.split(pattern, string)`            | List of the pieces of `string` around the occurrences of `pattern` |\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we saw in the previous section, `pandas` `Series` objects have a `.str` property\n",
    "that supports string manipulation using Python string methods. Conveniently,\n",
    "the `.str` property also supports some functions from the `re` module. {numref}`Table %s <regex-pandas>` shows the analogous functionality from {numref}`Table %s <regex-methods>` of the `re`\n",
    "methods. Each requires a pattern.\n",
    "See [the `pandas` docs][pd_str] for a complete list of \n",
    "string methods. \n",
    "\n",
    "[pd_str]: https://pandas.pydata.org/pandas-docs/stable/text.html"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{table} Regular expressions in `pandas`\n",
    ":name: regex-pandas\n",
    "\n",
    "| Method                                          | Return value                                                                |\n",
    "| ----------------------------------------------- | --------------------------------------------------------------------------- |\n",
    "| `str.contains(pattern, regex=True)`             | Series of booleans indicating whether the `pattern` is found                |\n",
    "| `str.findall(pattern, regex=True)`              | List of all matches of `pattern`                                            |\n",
    "| `str.replace(pattern, replacement, regex=True)` | Series with all matching occurrences of `pattern` replaced by `replacement` |\n",
    "| `str.split(pattern, regex=True)`                | Series of lists of strings around given `pattern`                           |\n",
    "\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regular expressions are a powerful tool, but they're somewhat notorious for being difficult to read and debug. We close with some advice for using regexes:  \n",
    "\n",
    "+ Develop your regular expression on simple test strings to see what the pattern matches.\n",
    "+ If a pattern matches nothing, try weakening it by dropping part of the pattern. Then tighten it incrementally to see how the matching evolves. (Online regex checking tools can be very helpful here.)\n",
    "+ Make the pattern only as specific as it needs to be for the data at hand.\n",
    "+ Use raw strings whenever possible for cleaner patterns, especially when a pattern includes a backslash. \n",
    "+ When you have lots of long strings, consider using compiled patterns because they can be faster to match (see `compile()` in the `re` library)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next section, we carry out an example text analysis.\n",
    "We clean the data\n",
    "using regular expressions and string manipulation, convert the text into\n",
    "quantitative data, and analyze the text via these derived quantities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [
    "RTJu-FQY3AIp",
    "TrPjIHHD8rOC"
   ],
   "default_view": {},
   "name": "working_with_text.ipynb",
   "provenance": [],
   "version": "0.3.2",
   "views": {}
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": false,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
